Many X Variables

Module 4.1: Multiple Regressors

Alex Cardazzi

Old Dominion University

Housing Price Models

In the last module, we developed quite a few models to explain housing prices. However, there are many factors that will impact the price of a house. Again, an important feature of economic models is that they’re relatively simple. However, our previous models were too simple. To make our model a bit more realistic, we can include other factors that might influence price. First, let’s read in the data, remove some columns, and change the names of the remaining columns.

Code
ames <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ames.csv")
keep_columns <- c("price", "area", "Bedroom.AbvGr", "Year.Built", "Overall.Cond")
ames <- ames[,keep_columns]
colnames(ames) <- c("price", "sqft", "bedrooms", "yr_built", "condition")

In addition, we are going to create a new variable for age.

Code
ames$age <- 2011 - ames$yr_built

Housing Price Models

We’ll start by estimating a simple model where price is determined by age. Next, we’ll estimate a model where price is determined by age and square footage. Before we estimate anything, though, let’s first establish some of the correlations between the variables.

Code
cor(ames[,c("price", "age", "sqft")])
Output
           price        age       sqft
price  1.0000000 -0.5584261  0.7067799
age   -0.5584261  1.0000000 -0.2417258
sqft   0.7067799 -0.2417258  1.0000000
  1. The correlation between price and age is negative. Again, this makes sense since younger houses are typically worth more money.
  2. The correlation between price and square footage is positive. We saw this last module – as square footage increases, so does price. This makes sense because bigger houses are worth more money.
  3. The correlation between age and square footage is negative. This suggests that younger houses are bigger! This is an important data artifact.

Housing Price Models

Why is this so important? Let’s first estimate a simple model:

\[\text{Price}_i = \alpha_0 + \alpha_1 \times \text{Age}_i + \epsilon_i\]

Code
reg_alpha <- lm(price ~ age, ames)
cat("Alpha0:", coef(reg_alpha)[1], "; Alpha1:", coef(reg_alpha)[2])
Output
Alpha0: 239269.1 ; Alpha1: -1474.964

\(\alpha_1\) = -1474.96 suggests that for every one year increase in property age, housing price decreases by $1474.96. However, when we are talking about an older house, we are also talking about generally smaller homes (as evidenced via the correlation). This means that the regression coefficient of -1474.96 is actually a composite of the effect of increased age and decreased square footage! In other words, when we increase age, two things happen: age increases but square footage also decreases. So which effect are we really picking up?

Housing Price Models

Obviously, in reality, houses do not shrink as they age, but the model doesn’t know that! It only understands the world through the data and correlations it observes. In these data, older houses are also smaller, so it’s important to control for the size of a home to get a true measure of the effect of age on prices.

Another way to think about this is that you are trying to compare houses that are the same in every way except age. This would be like a Randomized Control Trial (RCT). “Controlling” for square footage gets us closer to mimicking an RCT by removing the effect of changes in square footage.

Housing Price Models

How do we control for square footage?

To get an estimate of the isolated effect of age on price, we need some extra information. First, we need the initial regression where price is a function of age. Next, we need to know how square footage will respond to changes in age. To get this information, we’ll estimate a regression where square footage is the outcome and age is the explanatory variable. Lastly, we need to know how price responds to changes in square footage.

Code
coef(lm(price ~ age, ames)); cat("\n")
coef(lm(sqft ~ age, ames)); cat("\n")
coef(lm(price ~ sqft, ames))
Output
(Intercept)         age 
 239269.065   -1474.964 

(Intercept)         age 
1659.855230   -4.040108 

(Intercept)        sqft 
  13289.634     111.694 

Housing Price Models

The results:

  1. When we increase age by 1, sale price drops by -1474.96. Remember, this is a composite effect.
  2. When we increase age by 1, square footage drops by -4.04.
  3. How much would the price decrease because of the associated reduction in square footage? 111.69 is the change in price for a one square foot increase, but we have a square footage change of -4.04. Multiplying these two numbers together gives us -451.26. You can think of this as the “indirect” effect of decreasing property size due to increasing age.
  4. The effect of age alone would then be the composite effect (-1474.96) minus the effect attributable to the change in square footage (-451.26). This comes out to be: -1023.7

Housing Price Models

This is a rough approximation because all of those regressions are inherently biased for the same reason that the first one is. So, rather than estimating three regressions and doing some major algebratics1, we can do this all in one step. We are going to estimate the following model:

\[\text{Price}_i = \beta_0 + \beta_1 \times \text{Age}_i + \beta_2 \times \text{Sq. Ft.}_i + \epsilon_i\]

Without going through all the mathSpongbob Meme, OLS will simultaneously find two lines of best fit. In fact, this becomes a surface, or a plane, since there are two dimensions instead of just one. \(\beta_0, \beta_1, \text{and} \ \beta_2\) will still minimize the sum of squared residuals and provide an average residual of zero.

Now let’s estimate the model in R. To include another variable in lm(), simply add it to the right side of the formula like you would a math equation.

Housing Price Models

Code
reg_beta <- lm(price ~ age + sqft, ames)
coef(reg_beta)
Output
(Intercept)         age        sqft 
79973.60078 -1087.23673    95.96949 
  1. \(\beta_0\)’s interpretation is now the price of a property where all \(X\) variables are equal to zero. In this case, the price of a house that has an age of zero and zero square footage. Of course, a house like this does not exist, so \(\beta_0\) is not very meaningful.
  2. \(\beta_1\) is the effect of one additional year of age on price. The model has removed the effect of square footage from this coefficient since we have “controlled” for it by adding square footage into the model.1
  3. \(\beta_2\) is the effect of an additional square foot of size on final sale price, holding age constant.

This model also makes intuitive sense: it says that older houses are worth less and larger houses are worth more.

Housing Price Models

Let’s practice with another variable: bedrooms. Let’s estimate a model of the form:

\[\text{Price}_i = \gamma_0 + \gamma_1 \times \text{Sq. Ft.}_i + \gamma_2 \times \text{Bedrooms}_i + \epsilon_i\]

  1. We should expect that \(\gamma_1>0\) since bigger houses should be more valuable.
  2. We should also expect \(\gamma_2>0\) since houses with more bedrooms should also be more valuable.

Housing Price Models

Let’s estimate three models:

Code
r1 <- lm(price ~ sqft, ames)
r2 <- lm(price ~ bedrooms, ames)
r3 <- lm(price ~ sqft + bedrooms, ames)

regz <- list(`Price` = r1,
             `Price` = r2,
             `Price` = r3)
coefz <- c("sqft" = "Square Footage",
           "bedrooms" = "Bedrooms",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
library("modelsummary")
modelsummary(regz,
             title = "Effect of Sq. Ft. and Bedrooms on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Sq. Ft. and Bedrooms on Sale Price
Price Price Price
Square Footage 111.694*** 136.361***
(2.066) (2.247)
Bedrooms 13889.495*** −29149.110***
(1765.042) (1372.135)
Constant 13289.634*** 141151.743*** 59496.236***
(3269.703) (5245.395) (3741.249)
Num.Obs. 2930 2930 2930
R2 0.500 0.021 0.566

Housing Price Models

In the third model, we find:

  1. Each additional square foot increases price by $136.36.
  2. Each additional bedroom decreases price by $29,149.11.

Wait a minute More bedrooms decrease sale price? Initially, this might be confusing, so let’s think about this carefully. In words, this coefficient’s interpretation is:

Increasing the number of bedrooms by one, holding square footage constant, decreases sale price by $29,149.11.

Housing Price Models

What does it mean to increase the number of bedrooms in a property while holding square footage constant? Suppose a home is 2,000 square feet with four rooms. Each room is 500 square feet (on average). If we add an additional room, each room would only be 400 square feet (on average). Therefore, adding an extra bedroom, while holding square footage constant, makes for a bunch of small rooms. This is not something that is typically sought after on the housing market, hence the negative coefficient.

Housing Price Models

Below is a progression of three models:

Code
r1 <- lm(log(price) ~ log(sqft), ames)
r2 <- lm(log(price) ~ log(sqft) + bedrooms, ames)
r3 <- lm(log(price) ~ log(sqft) + bedrooms + age, ames)
regz <- list(`log(Price)` = r1,
             `log(Price)` = r2,
             `log(Price)` = r3)
coefz <- c("log(sqft)" = "log(Square Footage)",
           "bedrooms" = "Bedrooms",
           "age" = "Age")
gofz <- c("nobs", "r.squared")
library("modelsummary")
modelsummary(regz,
             title = "Determinants of Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Determinants of Sale Price
log(Price) log(Price) log(Price)
log(Square Footage) 0.908*** 1.094*** 0.875***
(0.016) (0.018) (0.015)
Bedrooms −0.138*** −0.082***
(0.007) (0.006)
Age −0.006***
(0.000)
Num.Obs. 2930 2930 2930
R2 0.523 0.580 0.731
Coefficient Interpretation for Model 3
  1. A 1% increase in square footage, holding bedrooms and age constant, increases sale price by 0.875%.
  2. An additional bedroom, holding age and square footage constant, reduces sale price by 8.2%.
  3. An additional year of age, holding square footage and bedrooms constant, reduces sale price by 0.6%.

Housing Price Models

Take a look at how the \(R^2\) changes from model to model. Each new control variable adds a little bit more information, which improves the model’s ability to explain. As a way to visualize this, we can plot the fitted values against the true, observed values. If the model was perfect, all the points would fall on the red 45°, \(y = x\) line.

Code
par(mfrow = c(1, 3))
ylim <- exp(range(r1$fitted.values, r2$fitted.values, r3$fitted.values))
plot(ames$price, exp(r1$fitted.values), ylim = ylim,
     ylab = "Fitted/Predicted Price", xlab = "",
     main = "Controls: sqft",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r1)$r.squared, 3)))
plot(ames$price, exp(r2$fitted.values), ylim = ylim,
     xlab = "Observed Price", ylab = "",
     main = "Controls: sqft + bedrooms",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r2)$r.squared, 3)))
plot(ames$price, exp(r3$fitted.values), ylim = ylim,
     ylab = "", xlab = "",
     main = "Controls: sqft + bedrooms + age",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r3)$r.squared, 3)))
Plot

Three plots of fitted vs predicted values. The correlation between the two increases in each plot from left to right.

As control variables are included, the \(R^2\) increases and the points start to get tighter to the diagonal line. This means that the predicted values are getting closer to the actual values.

A note about \(R^2\) You should not choose which variables are (or are not) important based on changes in \(R^2\). Technically, it is impossible for \(R^2\) to decrease after you add another variable. Of course, variables that add a lot of explanatory power to your model should be considered, but \(R^2\) does not tell you which model is best. As we will see later, there are trade-offs faced when including/excluding variables, so you should rely on theory and intuition to guide your modeling decisions.

Points that are above the 45° line are expected to have higher sale prices than what they actually sold for. Points below the line sold for higher prices that what the model predicted.

Zillow

Imagine you are a data scientist for Zillow, and you can observe characteristics about homes for sale and the list price. You could estimate a model to explain list price with square footage, bedrooms, age, size of the backyard, whether the property has a pool, etc. Then, for each property, you could generate a fitted sale price. If the fitted price is above the list price, this represents an underpriced home for sale (and an investment opportunity). On the other hand, a fitted price below a list price represents an overpriced home.

Zillow

In 2018, Zillow started “Zillow Offers”, a division of their company dedicated to buying and selling homes based on pretty much the exact logic outlined above. However, Zillow failed in a major way.

Why? What happened? There were ultimately a lot of reasons for why this did not work, but a major reason was omitted variable bias. As an extreme, facetious example, consider 100 properties that are all identical. Of course, Zillow’s model will therefore predict the exact same sale price for each of these homes. However, the list prices of these homes will certainly not be the same. First, people are not robots and have different circumstances. This is going to introduce random noise into the list prices. Also, homeowners know a lot about their own homes compared to what Zillow can observe in data. This is going to introduce non-random noise into the list price.

Zillow

Zillow was betting on the random part outweighing the non-random part. For example, suppose Zillow identifies the five most under-priced homes out of the lot of one hundred. These homes could be under-priced because the homeowners need to move ASAP for one reason or another and are willing to sell below market value. Or, these homes could be under-priced because the home is infested with termites and smells like a sewer. Not that Zillow was buying a bunch of termite traps, but their models were unable to control for everything that impacts a price of a home and they were selecting homes that were under-priced for non-random reasons.

NFTs

Of course, there are markets where this sort of logic can work as intended. In the beginning of the pandemic (Spring 2021), NFTs gained a lot of popularity. Specifically, NBA TopShot was incredibly popular.

NBA TopShot was a marketplace for the internet equivalent of basketball cards. The NFTs being traded (bought and sold) were highlights of basketball players dunking, shooting, defending, etc. Each highlight had a limited number released. For example, this highlight of Keldon Johnson only had 12,000 official copies. Prices of these NFTs depended on the player, the highlight, the rarity, and the serial number (#1 being the most expensive and #12,000 being the least expensive).

A clever econometrician could estimate a model for sale price and compare the fitted/predicted values to the listed prices of each NFT. NFTs with a list price under the predicted sale price could then be bought at a discount and immediately re-listed for the predicted sale price. Arbitrage.

NFTs

Let’s use example data from a single NFT from 3/15/2021. By limiting it to only one NFT highlight, the prices are now completely determined by serial number. Below is a plot of serial number and list price.

Plot

Serial Number vs Price for a Joel Embiid NFT.

This plot shows that most NFTs are relatively cheap with some very expensive ones. To get a clearer picture of the relationship, we can plot the data in a log-log format.

Plot

log Serial Number vs log Price for a Joel Embiid NFT.

It would be unreasonable to purchase an NFT with serial number \(s\) for price \(p\) if you could purchase an NFT with a lower serial number \(s-\epsilon\) for the same price \(p\) (or lower). Remember, the lower the serial number, the higher the value! So, let’s only consider NFTs where we cannot get a lower serial number without paying more. These will be colored in red.1

Plot

Serial Number vs Price for a Joel Embiid NFT.

NFTs

Once we have these “reasonable” NFTs picked out from the rest, we can fit a log-log regression through the points. This estimated parameters of this regression are:

Output
    (Intercept) logSerialNumber 
      9.0936344      -0.5453862 

Now we are in a very similar situation as Zillow except with one key difference: all of these NFTs are the same! If we find an undervalued NFT, it is not because it is infested with termites or because it smells like a sewer. It is purely due to randomness in pricing by individuals.

We can visualize this relationship below. Points that are below the line are under-priced. Points above the line are over-priced. The distance from the line tells us the severity of the under- or over-pricing. We want to identify the point that is the farthest below the line so we can buy it and re-sell it.

Plot

Serial Number vs Price for a Joel Embiid NFT.

NFTs

The NFT that is farthest from the line is the circled one. This NFT’s serial number is 138 and was on sale for $494. The model wants to re-sell it for $599.

Plot

Serial Number vs Price for a Joel Embiid NFT.

Of course, this sort of model is hypothetical and old news (the TopShot market has since crashed). However, the point I am trying to make is that Zillow omitted important variables in their models are the results were disastrous.

Omitted Variables

Hopefully by now, you’re convinced that omitted variable bias is a problem for prediction. Also, after seeing what happened in the age and square footage example, hopefully you can see that omitted variables can also bias our coefficient interpretation.

Unfortunately, sometimes we cannot observe every variable that we’d like to control for in our regression. However, we can sign the bias of our coefficients if we can anticipate correlation coefficients.

Consider the age and square footage example. Initially, our omitted variable was square footage. We knew that square footage was positively correlated with price but negatively correlated with age. So, when we put square footage into the model, the coefficient for age went from \(-1474.96\) to \(-1087.24\). So, the omitted variable bias was negative since it was pushing the coefficient down in the negative direction.

The same thing happened with square footage and bedrooms. Bedrooms and price are negatively correlated, but bedrooms and square footage are positively correlated. So, our coefficient moved from \(111.69\) to \(136.36\), which is the same direction as the previous coefficients. Therefore, omitting bedrooms also caused negative omitted variable bias in this regression.

Omitted Variables

If you have a model \(Y = \delta_0 + \delta_1\times A + \epsilon\) and \(B\) is your omitted variable, the bias in \(\delta_1\) will be:

Omitted Variable Bias Table
cor(A, B) > 0 cor(A, B) < 0
B’s effect on Y is positive \(+\) \(-\)
B’s effect on Y is negative \(-\) \(+\)

Remember: positive (negative) bias means the magnitude of the coefficient is artificially pushed upwards (downwards).

Unimportant Variables

Now we know the effect of omitting important variables on both our predictions and coefficient interpretations. But what is the effect of the opposite: including unimportant variables?

Including unimportant variables will not bias your other coefficients because if they’re truly unimportant, we would expect \(\beta_i = 0\). Take a look at the above table – if \(B\) does not effect \(Y\) whatsoever, then the bias will be neither positive nor negative regardless of \(B\)’s correlation with \(A\).

If this is the case, why don’t we throw every single variable we can think of into the model?! Well, there’s no such thing as a free lunch. Each time we include a new variable, especially when the new variable is highly correlated with one or more other variables, we lose a little bit of our precision. In other words, including new variables generally increases the standard errors of the coefficients for the other variables in the model. This makes it more difficult to effectively test our hypotheses.

Multicollinearity

Intuitively, adding two variables into a model that are highly correlated supplies the model with overlapping and redundant information. This makes it difficult for the model to parse out the effect of each variable individually. The problem of redundant information in a model is called multicollinearity. If two variables contain exactly the same information, the regression cannot differentiate between the two variables and therefore cannot be estimated without removing one of the variables. This is called perfect multicollinearity.

Multicollinearity

Multicollinearity occurs for one of two reasons:

  1. Structural: Sometimes we create variables based on already existing variables. For example, we created an “age” variable from the “year built” variable. Year built and age are highly correlated. In fact, since one is a linear function of the other1, including both into a regression will create perfect multicollinearity. Put differently, since the correlation between the two is -1, they contain exactly the same information.
Age and Year Built in a Regression
Code
lm(price ~ age + yr_built, ames)
Output

Call:
lm(formula = price ~ age + yr_built, data = ames)

Coefficients:
(Intercept)          age     yr_built  
     239269        -1475           NA  
Notice how the resulting “yr_built” coefficient is NA. This is because R could not estimate it and drops it from the model.
  1. Data-Based: Other times, variables are correlated simply because we are working with observational data. For example, in reality, the number of hours of SAT tutoring for a student is highly correlated with their family income. If we had an RCT that randomized the amount of tutoring students got, we wouldn’t have to worry about parental income. However, in an observational setting, this is just a fact of the matter.

Multicollinearity

To demonstrate the effects of multicollinearity on model output, we are going to simulate some data. Basically, we are going to simulate the following data generating process:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]

where \(\beta_0 = 0\) and \(\beta_1=\beta_2=1\).

The first time we run the model, \(X_1\) and \(X_2\) will be completely uncorrelated. However, we are going to use a for loop to re-run the model over and over, increasing the correlation between the two variables after each iteration. The code is not necessarily important, but you can still hopefully follow along.

Code
set.seed(757)
N <- 100 # sample size
x1 <- rnorm(N, 0, 1) # create X1
e <-  rnorm(N, 0, 3) # create unobserved error
all1 <- list()
for(a in seq(0, .99, 0.01)){
  
  # X2 is a combination of randomness and X1
  # When a = 0, X2 and X1 are completely uncorrelated
  x2 <- ((1-a)*rnorm(N, 0, 1)) + (a*x1)
  rho <- cor(x1, x2) # calculate correlation
  y <- x1 + x2 + e # Data Generating Process
  # Estimate regression and store coefficients
  tmp <- as.data.frame(coef(summary(lm(y ~ x1 + x2))))
  # Store correlation
  tmp$a <- rho
  all1[[length(all1) + 1]] <- tmp # save results
}

# Combine results
all <- do.call(rbind, all1)
# Round off correlation
all$a <- round(all$a, 2)

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
par(mar = c(4.1, 4.1, 1.1, 1.1))
plot(all$a[c(T,F,F)], las = 1, pch = 19,
     col = scales::alpha("black", 0.6), cex = 1.25,
     xlab = "Loop Iteration Number",
     ylab = "Corr. Between X1 and X2")
abline(h = c(0, 0.5, 0.9), lty = 2)
plot(all$a[c(F,T,F)], all$Estimate[c(F,T,F)], las = 1,
     xlab = "Correlation", cex = 1.25,
     ylab = "Coefficient Estimate", pch = 19,
     ylim = range(all$Estimate[c(F,T,F)][-which(all$Estimate[c(F,T,F)] == max(all$Estimate[c(F,T,F)]))]),
     col = scales::alpha("black", 0.6))
abline(h = 1)
plot(all$a[c(F,T,F)], all$`t value`[c(F,T,F)], las = 1,
     xlab = "Correlation", cex = 1.25,
     ylab = "t-Statistic", pch = 19,
     col = scales::alpha("black", 0.6))
abline(h = 1.96, lty = 2)
layout(matrix(c(1), 1, 1, byrow = TRUE))
Plot

Three images that demonstrate the effects of multicollinearity on coefficient estimates and standard errors.

Multicollinearity

Examining these three figures:

  1. As expected, the first few iterations have a very low correlation coefficient, hovering between 0 and 0.2. Then, the correlation coefficient reaches 0.5 by around iteration 40. Finally, the correlation reaches 0.9 by iteration 70 or so.
  2. Up until the correlation reaches $$0.95, the coefficient estimate is extremely stable and hovers around 1.
  3. However, the t-statistic for the coefficients signals that the regression is unable to differentiate between the coefficient estimate (which is about 1) and 0 at the 95% level starting when the correlation is around 0.5.

All three of these plots provide evidence that multicollinearity does not bias coefficients (until it becomes extreme), but does hinder the precision of the estimates.

Choosing Variables

To summarize the effect of including (or not including) additional regressors:

  1. Omitting important variables biases coefficient estimates and model predictions.
  2. Including unimportant variables reduces the precision of the coefficient estimates.

The way you should think about adding new variables to your model is through a bias-variance tradeoff. We know that coefficients can be biased if we omit important variables, but our standard errors get larger as the number of parameters we have to estimate increases.

It’s important to consider only the most important variables in your model to minimize both bias and variance. You might ask: how can I tell which variables are most important? To that I would say: good question, and the best answer is economic theory.

Choosing Variables

Economic theory should guide your choices about which variables to include/exclude and which functional forms you should use (log(), x + x^2, etc.). Unfortunately, this might seem like an unsatisfying answer, but econometrics is as much an art as it is a science. As you become more comfortable with econometrics, selecting variables and functional forms will become more natural.

In the next portion of the module, we will discuss how to incorporate binary/categorical variables into our models.